Introduction

In this part of this study, for initial modeling and analysis, I will be looking at the total number of thefts from January 2001 - March 11, 2023 that were reported. Note that due to the large size of the original dataset (nearly 8 million rows), the raw data is not included in this repository. The raw data can be accessed here. GitHub repository, including code for getting the variables: here.

Data Overview

For this partition of the data, there are two variables: year/month and the number of thefts reported in each month. The full dataset has more variables, which are described below. Each row in the full dataset represents an individual crime that was reported.

Variables

ID Unique identifier for the record.
Case number Chicago id for the case number
Date Date when the incident occurred, this is sometimes a best estimate.
Block The partially redacted address where the incident occurred.
IUCR The Illinois Uniform Crime Reporting Code.
Primary Type The primary description of the IUCR code.
Description The secondary description of the IUCR code.
Location description The primary description of the location where the incident occurred.
Arrest Whether or not the incident resulted in an arrest.
Domestic Whether or not the incident was a domestic incident.
Beat Indicates the beat where the incident occurred.
District The police district where the incident occurred.
Ward The city council district where the incident occurred.
Community Area The community area where the incident occurred.
FBI Code FBI Code crime classification.
X Coordinate The X coordinate location where the incident occurred.
Y coordinate The Y coordinate where the incident occurred.
Year Year the incident occurred.
Updated on Date and time the record was last updated.
Latitude The latitude where the incident occurred.
Longitude The longitude where the incident occurred.
Location The location of the incident.

Population Data Overview

The population data for this project was extracted using the US Census API ACS, which provided population data from 2005-2019 from Cook County. Since Cook County is the base of the Chicago metropolitan area, I chose to extract this data, and include a variable that scaled the number of thefts over time by the yearly population. So far, I haven’t been able to track down monthly population data for the Chicago area. Note that also the data from 2001-2005 is harder to access as it is only available through the population estimates API. Therefore, although the original dataset goes from 2001-2023, at this time, I am focusing on 2005-2019. The population from the Chicago website is limited to 2018-2022.

Unemployment Data Overview

The unemployment data is from the Illinois government website and focuses specifically on the Chicago metropolitan area. Although there are several variables related to Unemployment, for now, I will only focus on the unemployment rate for Chicago. Future work may extend this to more variables, such as the labor force participation rate.

thefts <- read_csv("data/thefts_unemployment_population.csv") %>%
  mutate(Month = yearmonth(Month)) %>%
  as_tsibble(index = Month)
## Rows: 144 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Month
## dbl (10): NumThefts, Population, NumTheftsPer10000, Labor Force, Labor Force...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(thefts)
## # A tsibble: 6 x 11 [1M]
##   NumThefts    Month Population NumThe…¹ Labor…² Labor…³ Emplo…⁴ Emplo…⁵ Unemp…⁶
##       <dbl>    <mth>      <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1      6075 2010 Jan    5394954    11.3  3735500    65.8 3280400    57.8  455100
## 2      4803 2010 Feb    5394954     8.90 3765000    66.3 3317200    58.5  447800
## 3      6085 2010 Mar    5394954    11.3  3772300    66.5 3341200    58.9  431100
## 4      6135 2010 Apr    5394954    11.4  3783300    67.3 3361700    59.8  421600
## 5      6722 2010 May    5394954    12.5  3756900    66.8 3358500    59.7  398500
## 6      6912 2010 Jun    5394954    12.8  3800600    67.5 3383000    60.1  417600
## # … with 2 more variables: `Unemployment Rate` <dbl>, `IL Rate` <dbl>, and
## #   abbreviated variable names ¹​NumTheftsPer10000, ²​`Labor Force`,
## #   ³​`Labor Force Participation Rate`, ⁴​Employed,
## #   ⁵​`EmploymentParticipation Rate`, ⁶​Unemployed

Data Exploration

Time Series Plots of Variables

Raw Number of Thefts - Original

From the original number of thefts, there appears to be a decreasing trend, although not quite a linear one. There doesn’t appear to be a cycle, although it’s possible with more data, there would be one. There is a clear seasonal trend of a peak in summer followed bso y a dip in January. Although the data is decreasing generally, there is a really sharp decrease in the number of thefts since 2020, likely reated to the pandemic. The data does not have constant variance: there appears to be less variance as the data continues on.

lambda <- thefts |>
  features(NumThefts, features = guerrero) |>
  pull(lambda_guerrero)
lambda
## [1] 0.6609234
thefts %>%
  autoplot(NumThefts)

### Raw Number of Thefts - Box Cox Transformation

With a box cox transformation, the data appears to have more stable variance. The seasonal trend is still prominent, as is the general trend of falling with a steep drop in the years 2020-2022.

  lambda <- thefts |>
  features(NumTheftsPer10000, features = guerrero) |>
  pull(lambda_guerrero)
lambda
## [1] 0.6775376
thefts <- thefts  %>%
  mutate(TheftsPer10000Transformed = box_cox(NumTheftsPer10000, lambda))
thefts <- thefts %>%
  mutate(NumTheftsTransformed = box_cox(NumThefts, lambda))
thefts %>%
  autoplot(NumTheftsTransformed)

thefts %>%
  gg_season(NumThefts)

The data appear to all have a really strong seasonal pattern, although the scale is significantly lower in 2021 and 2022, and is higher in 2012 and 2015. Note that in the next plot, we can see that the population of Chicago has been decreasing over time, which may partially explain this. Typically, there is a drop in thefts in the winter, and a peak in thefts around late spring and early fall, with it being highest in August.

thefts %>%
  gg_subseries(NumThefts)

This pattern is easier to see in the subseries plot, where clearly the number of thefts is highest between June and August, and lowest in February. Throughout all of the plots, we can see the number of thefts is decreasing over time, though note that the pandemic may be partially responsible, as well as the decline in population. ## Autocorrelation of the Number of Thefts {.tabset} In both the transformed and the original data, in general the autocorrelation is strictly positive, although it follows a general decreasing, then increasing, then decreasing trend, with the exception of the pandemic, where the autocorrelation of the number of thefts appearing to decrease, and then increase. The decreasing autocorrelation plots are not statistically significant, however almost all of the positive autocorrelations are statistically significant, quite so. The transformed number of thefts also follow this pattern, but to a lesser extent. ### ACF Plot of the Number of Thefts

thefts %>%
  ACF(NumThefts) %>%
  autoplot()

ACF Plot of the Box-Cox Transformed Number of Thefts

thefts %>%
  ACF(NumTheftsTransformed) %>%
  autoplot()

Number of Thefts Per 10,000 Chicago residents

thefts %>%
  autoplot(NumTheftsPer10000)

thefts %>%
  autoplot(TheftsPer10000Transformed)

Although the transformed number of thefts follows a similar decreasing pattern, it’s less extreme than in the non-transformed data. It is highest around 2010, with again a sharp dip in winter and a peak in summer.

thefts %>%
  gg_season(NumTheftsPer10000)

thefts %>%
  gg_subseries(NumTheftsPer10000)

# Plotting the Population Over the 10-Year Period There appears to be a strong quadratic trend to the population data from 2010-2022. There is no cyclicity, and since it is not seasonal data, there is also no seasonality. The variance appears pretty constant, so I don’t think a box cox transformation is necessary.

population <- read_csv("data/all_chicago_pop_estimates.csv") %>%
  as_tsibble(index = Year)
## Rows: 10 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Year, Population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
population %>%
  autoplot(Population)

Models

Initially, I will test the four simple models that we’ve covered in class: naive, seasonal naive, mean, and random walk with drift. There appears to be a stronger seasonal component than trend component, so my guess is that seasonal naive will probably perform the best, but I will test all 4 models with a test set and predicted set.

Raw Count of Thefts

Forecasting Number of Thefts from 2017-2019

Forecasting the number of thefts has kind of mixed results. Although there is a clear seasonal pattern with the number of thefts increasing during the summer, the pandemic in the last few years totally changed the results. However, I will still try to predict the future given all the four models in class, first with a training/test set, then with cross validation #### Model

thefts_train <- thefts %>%
  filter(year(Month) < 2019)
raw_count.model <- thefts_train %>%
  model(
    naive = NAIVE(NumThefts),
    snaive = SNAIVE(NumThefts),
    mean = MEAN(NumThefts),
    lm = RW(NumThefts ~ drift()),
    
  ) 
raw_count.model.forecast <- raw_count.model %>%
  forecast(h = "3 years")
raw_count.model.forecast %>%
  autoplot(filter(thefts, year(Month) >= 2015))

raw_count.model.forecast %>%
  accuracy(thefts)
## # A tibble: 4 × 10
##   .model .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm     Test  -1608. 1899. 1655. -49.8  50.6  4.43  4.07 0.906
## 2 mean   Test  -1624. 1924. 1674. -50.4  51.2  4.48  4.12 0.908
## 3 naive  Test  -1675. 1968. 1719. -51.7  52.5  4.60  4.21 0.908
## 4 snaive Test  -1424. 1773. 1467. -43.4  44.4  3.92  3.80 0.876

The root mean squared error is fairly high for all of these. Note that the scale of the data is around 5000 thefts per month, so the root mean squared error is very high relative to the data. However, the best performing model appears to be the seasonal naive, which captures the actual pattern.

Secondly, I will look at the plot of the residuals of the number of thefts.

raw_count.model <- thefts_train %>%
  model(
    snaive = SNAIVE(NumTheftsTransformed)
    
  ) 
raw_count.model.forecast <- raw_count.model %>%
  forecast(h = "3 years")
raw_count.model.forecast %>%
  autoplot(thefts)

raw_count.model %>%
  gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

There is strong autocorrelation in the ACF plot. The residuals appear to be left skewed, so not normally distributed. The residuals do not appear to be white noise.

Plot

raw_count.model.forecast %>%
  autoplot(thefts)

Forecasting Number of Thefts Per Capita

percapita.model <- thefts_train %>%
  model(
    naive = NAIVE(NumTheftsPer10000),
    snaive = SNAIVE(NumTheftsPer10000),
    mean = MEAN(NumTheftsPer10000),
    lm = RW(NumTheftsPer10000 ~ drift()),
    
  ) 
percapita.model.forecast <- percapita.model %>%
  forecast(h = "3 years")
percapita.model.forecast %>%
  autoplot(thefts)

percapita.model.forecast %>%
  accuracy(thefts)
## # A tibble: 4 × 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm     Test  -2.89  3.43  2.98 -47.9  48.7  4.26  3.96 0.903
## 2 mean   Test  -2.87  3.44  2.98 -47.8  48.8  4.26  3.97 0.906
## 3 naive  Test  -3.02  3.56  3.10 -49.9  50.6  4.43  4.11 0.906
## 4 snaive Test  -2.55  3.21  2.66 -41.5  42.8  3.80  3.70 0.874

We see similar results with the scaled thefts, however as we would expect with the scaling, the shape of the number of thefts is slightly different: obviously the scale is much smaller, and there also appears to be less variance. Again, none of the models appear to fit the data that well, although the seasonal naive appears to do the best. Again, all of the models have relatively high counts in proportion to the data.

Transformed Data

Since the data for the number of thefts per 10,000 residents of Chicago needed a box cox transformation, I will test whether that improved the model performance.

percapita.model.t <- thefts_train %>%
  model(
    naive = NAIVE(TheftsPer10000Transformed),
    snaive = SNAIVE(TheftsPer10000Transformed),
    mean = MEAN(TheftsPer10000Transformed),
    lm = RW(TheftsPer10000Transformed ~ drift()),
    
  )
percapita.model.forecast <- percapita.model.t %>%
  forecast(h = "3 years")
percapita.model.forecast %>%
  autoplot(thefts)

percapita.model.forecast %>%
  accuracy(thefts)
## # A tibble: 4 × 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm     Test  -1.47  1.76  1.51 -42.4  43.0  4.53  4.25 0.899
## 2 mean   Test  -1.44  1.75  1.49 -41.7  42.6  4.48  4.22 0.902
## 3 naive  Test  -1.53  1.82  1.57 -43.9  44.6  4.70  4.40 0.902
## 4 snaive Test  -1.30  1.63  1.35 -36.7  37.9  4.05  3.95 0.878

Tables

percapita.model.forecast %>%
  accuracy(thefts)
## # A tibble: 4 × 10
##   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm     Test  -1.47  1.76  1.51 -42.4  43.0  4.53  4.25 0.899
## 2 mean   Test  -1.44  1.75  1.49 -41.7  42.6  4.48  4.22 0.902
## 3 naive  Test  -1.53  1.82  1.57 -43.9  44.6  4.70  4.40 0.902
## 4 snaive Test  -1.30  1.63  1.35 -36.7  37.9  4.05  3.95 0.878

Residuals of the Best Method

As I expected, the best performing model appears to be the seasonal naive model, across all measures. Now, I will look at the residuals to see whether they look like white noise. There is still significant autocorrelation across much of the seasonal data. However, there appear to be fewer points with significant autocorrelation. The residuals are more clearly left skewed here.

percapita <- thefts %>%
  model(
    snaive = SNAIVE(NumTheftsPer10000)
  )
percapita %>%
  gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

raw_count.t.snaive <- thefts %>%
  model(
    snaive = SNAIVE(TheftsPer10000Transformed)
  )
raw_count.t.snaive %>%
  gg_tsresiduals()
## Warning: Removed 12 rows containing missing values (`geom_line()`).
## Warning: Removed 12 rows containing missing values (`geom_point()`).
## Warning: Removed 12 rows containing non-finite values (`stat_bin()`).

There is still significant autocorrelation in these residuals, and the distribution looks left skewed. Therefore, the seasonal naive method is not appropriate for predicting the number of thefts over time.

Exploring the Unemployment Variable

This variable, taken from the Illinois website, explores the un-seasonally adjusted unemployment rate in Chicago. Generally, there appears to be a decreasing trend among the unemployment from January 2015 to December 2022. However, again the pandemic significantly changed the unemployment rate: there is a dip in 2020, followed by a sharp peak afterwards, likely from people losing their jobs from the pandemic. The data is not seasonally adjusted, and there is a clear seasonal trend to the data of a spike every few months, particularly January. ### Original Variable

thefts %>%
  autoplot(`Unemployment Rate`)

Transformed using Box Cox

lambda <- thefts |>
  features(NumThefts, features = guerrero) |>
  pull(lambda_guerrero)

thefts <- thefts %>%
  mutate(UnemploymentRateTransformed = box_cox(`Unemployment Rate`, lambda))
thefts %>%
  autoplot(UnemploymentRateTransformed)

The transformed unemployment rate looks similar, although not identical. The variance is somewhat more stable, but the major issue is still the spike in 2020 of unemployment.

thefts %>%
  gg_season(`Unemployment Rate`)

There appears to typically be a small spike in unemployment in January, and peaks dring June. However, again 2020 is very different from the rest of the data, with a gradual increase from February to March, followed by a sharp peak in April and remains relatively high throughout 2020.

Lag plots

The lag plots show a strong autocorrelation, so this data is likely not white noise.

thefts %>%
  gg_lag(`Unemployment Rate`, lags = 1:12)

thefts %>%
  gg_season(`Unemployment Rate`)

thefts %>%
  gg_subseries(`Unemployment Rate`)

From the subseries plot, we can see that typically unemployment was decreasing in all months, with the exception of 2020, where it remains consistently much higher.

For this variable, I have chosen to do cross validation to select the model, then do a smaller training and test set once the general model has been evaluated. Again, I expect worse performance in 2020 than in other years.

fc <- thefts %>%
  stretch_tsibble(.init = 3, .step = 1) %>%
  filter(.id < max(.id)) %>%
  model(
    snaive = SNAIVE(`Unemployment Rate`),
    lm = RW(`Unemployment Rate` ~ drift()),
    mean = MEAN(`Unemployment Rate`),
    naive = NAIVE(`Unemployment Rate`),
  ) %>% forecast(h = 3) |>
  group_by(.id) |>
  mutate(h = row_number()) |>
  ungroup() |>
  as_fable(response = "Unemployment Rate", distribution = `Unemployment Rate`)
fc %>%
  accuracy(thefts)
## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 2 observations are missing between 2022 Jan and 2022 Feb
## # A tibble: 4 × 10
##   .model .type      ME  RMSE   MAE     MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lm     Test   0.0846  1.75 0.816  -0.253  10.6 0.476 0.579 0.664
## 2 mean   Test  -1.74    2.87 2.35  -38.2    42.8 1.37  0.949 0.848
## 3 naive  Test  -0.0988  1.73 0.820  -3.01   10.9 0.478 0.572 0.663
## 4 snaive Test  -0.398   3.04 1.73  -11.8    24.0 1.01  1.00  0.827

From cross validation, the bet performing model varies based on the metric. However, overall, the naive model and the random walk with drift appear to perform the best. There is a linear trend across most of the data, and the seasonal pattern exists but doesn’t appear to be that significant.

thefts |>
  model(RW(`Unemployment Rate` ~ drift())) |>
  gg_tsresiduals()
## Warning: Removed 1 row containing missing values (`geom_line()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing non-finite values (`stat_bin()`).

With these residuals, overall the ACf plot doesn’t show any significant autocorrelation. However, the residuals do appear skewed and not normally distributed, and again there is that sharp peak in 2020.